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ABSTRACT 

Complex biological functions emerge through intri- 
cate protein-protein interaction networks. An 
important class of protein-protein interaction corres- 
ponds to peptide-mediated interactions, in which a 
short peptide stretch from one partner interacts with 
a large protein surface from the other partner. Protein- 
peptide interactions are typically of low affinity and 
involved in regulatory mechanisms, dynamically 
reshaping protein interaction networks. Due to the 
relatively small interaction surface, modulation of 
protein-peptide interactions is feasible and highly at- 
tractive for therapeutic purposes. Unfortunately, the 
number of available 3D structures of protein-peptide 
interfaces is very limited. For typical cases where a 
protein-peptide structure of interest is not available, 
the PepSite web server can be used to predict 
peptide-binding spots from protein surfaces alone. 
The PepSite method relies on preferred peptide- 
binding environments calculated from a set of 
known protein-peptide 3D structures, combined with 
distance constraints derived from known peptides. 
We present an updated version of the web server 
that is orders of magnitude faster than the original im- 
plementation, returning results in seconds instead of 
minutes or hours. The PepSite web server is available 
at http://pepsite2.russelllab.org. 



INTRODUCTION 

Protein-protein interactions play a key role in the regula- 
tion of all cellular functions. A subset of protein-protein 
interactions of particular interest are those mediated by 
short linear peptides (~3-10 amino acids), mostly 
residing in intrinsically disordered regions of proteins 



and often having a conserved sequence pattern, in which 
case they are termed short linear motifs (SLiMs) (1). 
Peptide-mediated interactions often regulate biological 
processes that require dynamic and specific responses 
(2). Examples of such processes include protein localiza- 
tion (3), endocytosis (4), post-translational modifications 
(5) and signaling pathways (6). The importance of 
peptide-mediated interactions is further demonstrated by 
their involvement in several human diseases, such as 
cherubism (7), cancer (8) and viral infections (9,10). 
Moreover, it has been shown that protein-peptide inter- 
actions can be modulated by chemicals or synthetic 
peptides for therapeutic purposes (11-13). Therefore, the 
ability to accurately identify and describe protein-peptide 
interactions in detail bears tremendous potential in fur- 
thering our understanding of complex cellular regulatory 
mechanisms, as well as enabling rational modulation of 
protein-protein interactions for therapeutic purposes. 

There are several known SLiMs deposited in public 
databases [ELM (14), MnM (15), PROSITE (16)]. These 
databases, however, cover only a fraction of the estimated 
number of peptides and motifs actually used in the cells 
(17). Methods to identify new instances of known motifs, 
include ELM (14), Prosite (16), AD AN (18) and iELM 
(Weatheritt et al., 2012, in this special edition), whereas 
others focus on finding or providing functional context for 
motifs [e.g. SLiMPred (19), SLiMFinder (20), DiLiMoT 
(21), PRATT (22) and SLiMSearch (23)]. These methods 
focus mainly on the peptide motif and provide little or no 
information regarding the protein-peptide interface. 
Docking has been successfully used to predict protein- 
peptide interfaces for short peptides of up to four 
residues (24). For more typical peptide lengths (5-10 
residues) and unknown binding site, docking is less 
feasible due to the large search space of peptide conform- 
ations and binding sites to be explored. Other approaches 
for predicting protein-peptide interfaces perform well with 
larger peptides, but limit their predictions to interactions 
involving certain well-characterized domains [e.g. SH3 
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(25), WW (26) and PDZ (27)]. Finally, there are several 
methods available (28) that identify functional sites on 
protein structures, e.g. Rate4site (29), or predict sites for 
generic or chemical ligand binding, e.g. SiteHound (30). 
These methods, however, are tailored to identifying either 
chemical ligand sites or general functional sites and are, 
therefore, limited in their performance toward predicting 
peptide-binding sites [see, e.g. 'Discussion' section in (31)]. 

To address the lack of a generic tool to predict binding of 
any linear peptide onto any protein structure, we previ- 
ously developed the PepSite method (31). Using a large 
collection of protein-peptide interactions of known struc- 
ture, the preferred binding environment of each peptide 
residue type is calculated and encoded in a so-called 
spatial position-specific scoring matrix (S-PSSM). Given a 
user-provided protein structure, PepSite scans the protein 
surface with the S-PSSMs and generates candidate binding 
sites for peptide residues. Finally, a peptide sequence of 
interest can be matched against the predicted residue 
binding sites, subject to certain distance constraints, result- 
ing in approximate predicted peptide structures bound to 
the protein surface. Results from PepSite can be combined 
with a method such as FlexPepDock (32,33), which 
computes an atomic model for the peptide given an 
approximate binding site. A web server providing access 
to the initial version of PepSite has been available for the 
last 3 years. In this article, we present a new web server 
based on PepSite 2, a complete rewrite of the software in 
the C programming language. PepSite 2 typically generates 
results in seconds, as opposed to minutes or even hours 
required by the initial implementation. The new PepSite 
version opens up many possibilities, such as exploration 
of entire proteomes in large scale, in silico protein- 
peptide discovery experiments. 

MATERIALS AND METHODS 

Spatial position-specific scoring matrices 

The PepSite approach leverages 3D structural information 
of protein-peptide interactions to predict new instances of 
peptide-binding sites given a protein surface. A data set of 
405 protein-peptide complexes of known 3D structure was 
previously collected and used to train and validate the 
method (31). For each supported peptide residue type 
(currently all 20 standard residues plus phosphorylated 
Ser, Thr and Tyr), the S-PSSM capturing its preferred 
binding environment is constructed. Each protein, heavy 
atom is mapped to one of the 14 custom-defined atom 
types, and a 3D grid is constructed for each combination 
of peptide residue type and protein atom type. Examples 
of atom types include oxygen from a carbonyl group, 
aromatic carbon, etc. [see (31) for details]. As a first 
step, relative abundances for the 14 atom types on 
protein surfaces are calculated from a representative set 
of 100 protein structures, thus defining a background dis- 
tribution. The representative set is defined by taking a 
random sample from a set of representative structures 
clustered at 30% sequence identity retrieved from the 
PDB via its REST web service interface (34). Protein 
surface atoms are defined as those with positive solvent 



accessibility scores calculated with NACCESS 2.1.1 
(http : / /www. bioinf .manchester . ac . uk /naccess/) . 

For a given peptide residue type r (e.g. Pro), construc- 
tion of the S-PSSM proceeds as follows. Each instance of 
residue r in peptides in the training set is structurally 
superposed to a reference r side chain using PINTS (35), 
and the same transformation matrix is applied to the 
coordinates of the corresponding interacting proteins 
with STAMP (36). The result is a 3D cloud of protein 
atoms around a reference r side chain that characterizes 
the preferred protein environment that interacts with r 
residues in peptides. For each protein atom type z 
(z = 1, . . ., 14), a 3D grid centered at the reference r side 
chain is generated, with each voxel v defined as log-odds 
score, i.e. 

$>r,i,v = 10g(n (j0 bserved/' 3 ;: )eX p ec ted) 

where «,, 0 bserved is the observed number of atoms of type i 
in voxel v and M; >eX pected is the expected number of atoms of 
type i given by the relative abundance of atom type i in the 
background distribution times the total number of protein 
atoms in voxel v. Each grid contains 64 voxels with a 
volume of 9 A each, as previously described (31). 

Prediction of hot spots 

Given a protein structure of interest, preferred sites for 
amino acid binding ('hot binding spots' or simply 
'hot spots') are predicted as follows. Atomic solvent 
accessibility scores are calculated with NACCESS 2.1.1 
and surface points are defined as the coordinates of 
protein atoms with positive accessibility scores. 
Approximate surface normals are calculated for each 
surface point by connecting its position to the geometric 
center of protein atoms within 6 A. For each surface point 
s, each set of S-PSSMs is placed along the approximate 
normal. Each protein atom j of type i(j) that falls within 
the S-PSSMs is assigned to a voxel v(j) and receives a score 
Sr,i(f),v(f) f° r eacn supported peptide residue type r. An ag- 
gregate score is computed for each peptide residue type r 
as S ri j(f) iV (f), where the sum is computed over all protein 
atoms that fall within the S-PSSMs. The distance and 
orientation of each S-PSSM with respect to the surface 
atom s are then sampled as to maximize Yl] ^r,i(j),v(jy 
Thus, for peptide residue type r, a score capturing its 
binding propensity is calculated for each surface point s. 
Surface points are then pruned by enforcing a minimum 
separating distance and avoiding clashes with the protein 
structure, keeping the points with the highest score. 
Finally, predicted hot spots are given by the top-scoring 
surface points, with the hot spot coordinates given by the 
center of the corresponding S-PSSMs. 

Prediction of peptide-binding sites 

Provided a list of predicted hot spots, obtained as 
described above, and a query sequence, PepSite employs 
a recursive backtracking algorithm to find all partial 
matches conforming to defined distance constraints. 
Concretely, if a peptide query is PLWPR, PepSite will 
exhaustively explore all possible combinations of the pre- 
dicted hot spots for Pro, Leu, Trp and Arg, building an 
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approximate 3D model of the peptide bound to the 
protein surface of interest, allowing for partial matches. 
For instance, a match could consist of PL-P-, in which 
three residues were assigned coordinates and scores of pre- 
dicted hot spots, and the distance between all the pairs of 
matched residues lie within ranges usually seen in peptide 
structures. 

The distance constraints are defined as follows. For 
each supported peptide residue type r, a distribution of 
the distance between its 'active center' (a subset of the 
side chain) and its Ca atom is calculated from the 
training set, with mean denoted by < > . 
Furthermore, Ca-Ca distance distributions are also 
calculated for peptide residue pairs (k, k+\), (k, k+2), 
etc. with mean denoted by <dZ>. Matches calculated by 
PepSite have the property that" for every pair of matched 
residues (i,j), with residue types r(i) and r(f), the distance 
between their corresponding hot spot coordinates d t j 
satisfies 

ca act act 

<d > — a(<d >+<d >) 

y v r® Hit ' 

hs ca act act 

< d, ,■ < <d >+a(<d >+<d >), 

y y v .■(,) HI) " 

where a is a free parameter. Minimum and maximum 
number of residues to be matched are also imposed 
based on known protein-peptide complexes; the 
minimum number of matched residues is currently set to 
2, whereas the maximum is currently set to minimum 
(6, 1 + 0.67 L), where L is the query length (L = 5 for 
the PLWPR example above). 

The overall raw score of a match is obtained by 
summing the hot spot score for each matched peptide 
residue (hot spot scores are described in the previous 
section). Considering the example above of a PL-P- 
match, the raw score corresponds to the first matched 
Pro hot spot score, plus the matched Leu hot spot score, 
plus the second matched Pro hot spot score. With the aim 
to make the scores of matches with different size 
comparable, P-values are calculated as follows. For each 
peptide length, raw scores are calculated by running 
PepSite on random peptide sequences against 
representative protein structures, obtained as described 
earlier in the text. The raw score distribution for each 



A 




peptide length is then fitted to a Gumbel distribution. 
When matches are generated by PepSite in response to a 
query of interest, raw scores are converted to P-values 
using the corresponding fitted Gumbel distribution. 
Extensive benchmarks can be found in the original 
publication (31). 

THE PEPSITE WEB SERVER 

The PepSite web server can be accessed at http://pepsite2 
.russelllab.org. It is free and open to all and there is no 
login requirement. In a typical use of the server, a user 
queries for a peptide sequence and a protein structure, 
specified either via a protein data bank (PDB) code and 
chain or by uploading a structure in PDB format. The 
calculated peptide-binding spots are displayed both as a 
table, ordered by statistical significance, and through an 
interactive molecular visualization. Predicted peptide- 
binding sites can also be downloaded in PDB format. 
Molecular visualizations are generated by default using 
Jmol (http://www.jmol.org/), a popular Java viewer. In 
addition, experimental support for WebGL-based 
visualizations generated using VMD (37) and X3D0M 
(http://www.x3dom.org/) will be added in the near future. 

Example application 

To illustrate the use of the PepSite server, let us consider a 
protein-peptide interaction of interest without an available 
structure. Menin is a ubiquitously expressed protein with 
many interacting partners, thus implicated in a range of 
biological processes (38). In particular, menin is a critical 
oncogenic cofactor of mixed lineage leukemia (MLL) 
fusion proteins, required for their leukemogenic activity 
and loss of the highly specific menin-MLL interaction 
disrupts the oncogenic potential (39,40). Thus, modulation 
of this interaction is an attractive target for acute leukemias 
with MLL rearrangements (38). It has been determined 
that two short fragments of MLL interact with menin, 
with the first (MBM1, residues 4-15) representing the 
high-affinity binding motif (41). As the structure of the 
menin-MBMl interface is not available, one can use 
PepSite to predict the MBM1 -binding site using as inputs 
the MBM1 peptide sequence and the recently solved 




Figure 1. Top prediction of an MLL peptide (residues 4-15, RWRFPARP according to UniProt accession Q9Y6P1) bound to a menin structure 
from N. vectensis (PDB 3RE2, chain A) (38). The menin structure is displayed either as a cartoon (A) or as a surface (B). Image generated with 
VMD (37). 
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Nematostella vectensis crystal structure (38). The predicted 
binding site lies in a large hydrophobic pocket from menin 
(Figure 1). Indeed, this pocket has been previously 
hypothesized to be the binding site for the MLL peptide, 
a hypothesis further supported by a series of mutagenesis 
experiments (38). The coarse-grained model of the menin- 
MBM1 binding interface generated by PepSite could be 
further refined using, e.g. FlexPepDock (32,33), and the 
resulting atomic model could then be used to rationally 
design a competitive inhibitor of the menin-MLL 
interaction for therapeutic purposes. 

The PepSite API 

PepSite can also be run programmatically via a simple 
REST web service interface. The peptide sequence and 
PDB code and chain are encoded in the URL request, 
and results may be retrieved in plain text or PDB format. 
Protein structures may also be specified by way of a 
UniProt accession or identifier, in which case PepSite will 
attempt to map the request to a suitable PDB structure 
(see online documentation for details). The iELM web 
server (http://i. elm. eu.org; Weatheritt et al., 2012, in this 
special edition), which predicts protein-peptide 
interactions involving linear motifs annotated in ELM 
(14), makes use of the PepSite API. 

CONCLUSION 

The PepSite web server allows users to predict peptide- 
binding sites, given a peptide sequence and a 3D structure 
of the receptor protein. The new version is orders of 
magnitude faster, with results visualized typically in a few 
seconds, thus allowing users to explore a range of 
hypothesis interactively, such as progressively mutating 
the peptide sequence and determining the effect on the 
predictions. The PepSite API allows the server to be 
accessed programmatically, which means PepSite can 
now be easily integrated into bioinformatics pipelines, in 
particular as part of large-scale in silico interaction 
discovery experiments. Several improvements are being 
implemented in order to increase the input flexibility, 
such as allowing users to enter linear motifs instead of 
complete peptide sequences, or restrict the search to a 
subset of the protein structure. Improvements to molecular 
visualizations are also being implemented, including a 
WebGL-based option for modern web browsers. Another 
feature under development is the ability to scan 
overlapping windows of a protein sequence to determine 
the most likely peptide stretch responsible for an 
interaction of interest, as previously suggested (31). 
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